A. The Black Box Problem

You cannot ship what you cannot trace

Agenda

  • A. The Black Box Problem — Why agent debugging is different (~10 min)
  • B. Structured Tracing — Seeing inside the agent’s brain (~15 min)
  • C. Loop Detection — Catching agents that spin in circles (~15 min)
  • D. Cost Tracking — Monitoring spend per query (~10 min)
  • E. DeepEval for Agents — Production-grade evaluation (~20 min)
  • F. Wrap-up — Key takeaways & lab preview (~5 min)

Traditional vs Agent Debugging

Traditional Debugging Agent Debugging
Same input -> same output Same input -> different outputs
Stack trace shows exact failure Failure emerges over multiple steps
Unit tests with assertEqual Subjective quality evaluation
Fixed cost per execution Variable cost (1-50 LLM calls)
Errors crash the program Errors may be silently “reasoned away”

The Core Problem

An agent might technically succeed (no crash, produces an answer) while being completely wrong. Or it might spend $0.50 on a $0.02 question. You can’t fix what you can’t see.

Five Ways Agents Fail

  1. Prompt ambiguity — the agent didn’t understand the task
  2. Tool misuse — right tool, wrong arguments
  3. Formatting errors — tried to format JSON, failed
  4. Infinite loops — kept searching “Python” 50 times
  5. Hallucination — confidently lied about a search result it never saw

Production Tip: Never deploy an agent without tracing. Costs can explode if an agent enters an infinite loop.

B. Structured Tracing

Every step, captured and queryable

What a Good Trace Captures

For every step in the agent loop:

  • Trace ID (unique per request)
  • Step number
  • Agent’s reasoning (LLM content)
  • Tool calls (name + arguments)
  • Tool results
  • Token usage (input/output)
  • Cost per step (USD)
  • Duration (milliseconds)
[Step 3] (450ms, $0.0042)
  Reasoning: "The search returned results about Paris. Let me..."
  Tool: search({"query": "population of Paris 2024"})
  Result: "The population of Paris is approximately 2.1 million..."

The Trace Data Model

@dataclass
class ToolCallRecord:
    tool_name: str
    tool_input: dict
    tool_output: str
    duration_ms: float

@dataclass
class AgentStep:
    step_number: int
    reasoning: Optional[str]
    tool_calls: list[ToolCallRecord]
    input_tokens: int = 0
    output_tokens: int = 0
    cost_usd: float = 0.0

@dataclass
class Trace:
    trace_id: str
    agent_name: str
    steps: list[AgentStep]
    status: str  # "running", "completed", "failed", "loop_detected"
    total_cost_usd: float = 0.0

The AgentTracer Class

class AgentTracer:
    def start_trace(self, agent_name, query, model) -> str:
        """Start a new trace. Returns trace_id."""

    def log_step(self, trace_id, step: AgentStep):
        """Log a completed step — accumulates tokens and cost."""

    def end_trace(self, trace_id, output, status="completed"):
        """Mark trace as complete."""

    def get_trace_json(self, trace_id) -> str:
        """Export trace as JSON for debugging."""

    def print_summary(self, trace_id):
        """Human-readable trace summary."""

In production, this would send data to Datadog, LangSmith, or Arize. For this course, we log to console and export to JSON.

Trace Output Example

============================================================
TRACE SUMMARY: a1b2c3d4
============================================================
Agent: react_agent | Model: gpt-4o
Status: completed
Query: What is the population of the capital of France?

Steps (3 total):
------------------------------------------------------------
  Step 1: 1200ms, $0.0085 -> search
  Step 2: 980ms,  $0.0062 -> search
  Step 3: 450ms,  $0.0031 (no tools)
------------------------------------------------------------
Total Tokens: 2847 input + 312 output = 3159
Total Cost:   $0.0178
Total Time:   2630ms

Answer: The population of Paris is approximately 2.1 million.
============================================================

C. Loop Detection

Catching agents that spin in circles

The Infinite Loop Problem

An agent calls search("python tutorial"), gets a result, then calls search("python tutorial") again. And again. And again.

Why it happens:

  • The model doesn’t “understand” the result satisfied the query
  • The tool returns an error, and the agent retries with the same arguments
  • The prompt is ambiguous, so the agent keeps trying variations

Cost Impact

A looping agent can burn through hundreds of API calls at $0.01-0.05 each. A single bad query can cost $5+ before max_steps kicks in.

Three Detection Strategies

graph LR
    TC["Tool Call"] --> E["Exact Match<br/>Same tool + same args"]
    TC --> F["Fuzzy Match<br/>Similar args (Jaccard)"]
    TC --> S["Output Stagnation<br/>Similar outputs repeated"]

    style E fill:#FF7A5C,stroke:#1C355E,color:#1C355E
    style F fill:#9B8EC0,stroke:#1C355E,color:#1C355E
    style S fill:#00C9A7,stroke:#1C355E,color:#1C355E

Strategy 1: Exact Match

Same tool + identical arguments repeated N times.

# Before executing each tool call:
count = tool_history.count((current_tool_name, current_args_string))

if count >= exact_threshold:
    return LoopDetected(confidence=1.0)  # 100% — always a loop

tool_history.append((current_tool_name, current_args_string))

Confidence: 100% — this is always a loop.

Strategy 2: Fuzzy Match

Similar (but not identical) tool calls — catches minor rephrasing.

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

"python tutorial basics" ∩ "basics python tutorial" = {python, tutorial, basics}
→ |intersection| / |union| = 3/3 = 1.0  ← loop detected!

"python tutorial" ∩ "python guide" = {python}
→ 1 / 3 = 0.33  ← different query, not a loop
  • Threshold: 0.8 — catches rephrasings, ignores truly different queries

Strategy 3: Output Stagnation

The agent keeps producing similar responses — not making progress.

# Compare pairwise similarity of last N agent outputs
recent_outputs = output_history[-stagnation_window:]

if avg_pairwise_similarity(recent_outputs) >= threshold:
    return LoopDetected(strategy="stagnation")

The Circuit Breaker Pattern

Combine loop detection with the agent loop to break infinite cycles:

for step in range(max_steps):
    # ... get LLM response, extract tool calls ...

    for tool_call in tool_calls:
        # Check BEFORE executing
        loop_check = loop_detector.check_tool_call(
            tool_call.name, str(tool_call.arguments)
        )

        if loop_check.is_looping:
            # Inject warning into conversation instead of executing
            messages.append({
                "role": "tool",
                "content": f"LOOP DETECTED: {loop_check.message}"
            })
            break  # Skip remaining tool calls this step

The agent receives the loop warning as if it were a tool result — it can then change strategy.

D. Cost Tracking

Monitoring spend per query

Why Track Cost?

Without tracking:

  • “Our AI bill was $500 this month”
  • “Some queries are expensive but we don’t know which”
  • No budget enforcement

With tracking:

  • “Query X cost $2.30 (15 steps)”
  • “Average cost: $0.12/query”
  • Budget alerts per query

Cost Tracking with LiteLLM

LiteLLM provides built-in cost calculation:

from litellm import completion_cost

response = completion(model="gpt-4o", messages=messages)

# Get cost from the response
cost = completion_cost(completion_response=response)
print(f"This step cost: ${cost:.4f}")

Integrate with the tracer:

tracer.log_cost(trace_id, step_number=step, cost_usd=cost)

# At the end of the trace:
# Total Cost: $0.0178  (sum of all steps)

Setting Budget Limits

class CostTracker:
    def __init__(self, budget_limit_usd: float = 1.0):
        self.budget_limit = budget_limit_usd
        self.total_spent = 0.0

    def add_cost(self, cost: float) -> bool:
        self.total_spent += cost
        if self.total_spent > self.budget_limit:
            raise BudgetExceededError(
                f"Query cost ${self.total_spent:.2f} "
                f"exceeds budget of ${self.budget_limit:.2f}"
            )
        return True

Production Tip: Set per-query budgets ($1-5) AND daily budgets ($50-500). A single runaway agent should never bankrupt your project.

E. DeepEval for Agents

Production-grade evaluation framework

The Evaluation Problem

How do you know if your agent is any good?

  • Manual review doesn’t scale
  • String matching can’t evaluate free-text answers
  • Unit tests check format, not quality
  • Custom LLM judges are hard to maintain and calibrate

The Solution: DeepEval

An open-source framework with 50+ research-backed metrics for LLM and agent evaluation. Integrates with pytest for CI/CD workflows.

Why DeepEval for Agents?

Traditional Metrics Fail

  • BLEU/ROUGE: word overlap, not meaning
  • Accuracy: binary, no nuance
  • Precision/Recall: not for open-ended outputs

DeepEval Approach

  • LLM-as-Judge with calibrated prompts
  • Agentic metrics: task completion, tool correctness
  • Component-level evaluation via tracing

DeepEval Agent Metrics

Metric What It Measures
TaskCompletionMetric Did the agent accomplish the intended task?
ToolCorrectnessMetric Were the right tools called with correct args?
StepEfficiencyMetric Was the execution path optimal?
PlanQualityMetric Was the agent’s plan logical and complete?
PlanAdherenceMetric Did the agent follow its own plan?

LLM Tracing with @observe

DeepEval uses decorators to trace agent components:

from deepeval.tracing import observe

@observe(metrics=[TaskCompletionMetric(threshold=0.7)])
async def run_agent(self, query: str) -> str:
    # Agent loop executes here
    # DeepEval captures: steps, tool calls, reasoning
    return final_answer

Each traced function becomes a span — a unit for evaluation.

Tracing Agent Components

from deepeval.tracing import observe, update_current_span

@observe(type="agent")
class ReactAgent:
    @observe(type="llm")
    def call_llm(self, messages): ...
    
    @observe(type="tool")
    def execute_tool(self, name, args): ...

The trace captures the full execution graph for debugging and evaluation.

Evaluation with Test Cases

from deepeval import evaluate
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import TaskCompletionMetric, ToolCorrectnessMetric

test_case = LLMTestCase(
    input="What is the population of Paris?",
    expected_output="approximately 2.1 million",
    tools_called=[
        ToolCall(name="search", args={"query": "population of Paris"})
    ]
)

evaluate(
    test_cases=[test_case],
    metrics=[TaskCompletionMetric(), ToolCorrectnessMetric()]
)

Evaluation Results Dashboard

======================================================================
DEEPEVAL EVALUATION RESULTS (5 test cases)
======================================================================
Test Case                      Task Comp  Tool Correct  Step Eff  Overall
---------------------------------------------------------------------
What is the capital of France?    PASS        PASS         -      PASS
Compare Python and JavaScript     PASS        PASS       0.85     PASS
Latest AI research trends         FAIL        PASS       0.62     FAIL
Population of Tokyo metro         PASS        PASS       0.91     PASS
Explain quantum computing         PASS        FAIL       0.78     FAIL
---------------------------------------------------------------------
PASS RATE: 60% (3/5)
Average Step Efficiency: 0.79
======================================================================

Building an Evaluation Dataset

{
  "test_cases": [
    {
      "input": "What is the population of Tokyo?",
      "expected_output": "approximately 14 million",
      "expected_tools": ["search"],
      "metadata": {"category": "factual", "difficulty": "easy"}
    },
    {
      "input": "Compare renewable energy policies of EU and US",
      "expected_output": "The EU has committed to... while the US...",
      "expected_tools": ["search", "search"],
      "metadata": {"category": "comparison", "difficulty": "hard"}
    }
  ]
}

CI/CD Integration

# test_agent.py - runs in your CI pipeline
import pytest
from deepeval import assert_test
from deepeval.metrics import TaskCompletionMetric

def test_agent_factual_queries():
    test_case = LLMTestCase(
        input="What is the capital of France?",
        expected_output="Paris"
    )
    assert_test(test_case, [TaskCompletionMetric(threshold=0.8)])

Best Practice

Maintain a living eval dataset (20-50 examples). Run evals on every PR. “It feels better” becomes “task completion improved from 72% to 89%.”

Embedding-Based Metrics

When LLM-as-Judge is too expensive or slow:

Metric What It Measures Cost
RetrievalRelevanceMetric Avg cosine similarity: query ↔︎ retrieved chunks $0
ContextCoverageMetric How well retrieved chunks cover expected contexts $0
AnswerSimilarityMetric Cosine similarity: expected ↔︎ actual answer $0
from evaluation.embedding_metrics import get_embedding_metrics

relevance, coverage, answer = get_embedding_metrics(
    relevance_threshold=0.25,
    coverage_threshold=0.4
)

result = relevance.evaluate(query_embedding, retrieved_embeddings)
# Offline, no API calls, instant feedback

Test Case Management

Structure your evaluation data:

test_cases = [
    {
        "input": "What is the capital of France?",
        "expected_output": "Paris",
        "expected_tools": ["search"],
        "category": "factual",
        "difficulty": "easy"
    },
    {
        "input": "Compare the population of Tokyo and Paris",
        "expected_output": "Tokyo: 14M, Paris: 2.1M",
        "expected_tools": ["search", "search"],
        "category": "multi_step"
    }
]

Load and save with helper functions:

from evaluation.test_cases import load_test_cases, save_test_cases

cases = load_test_cases("test_cases.json")
save_test_cases(new_cases, "updated_test_cases.json")

Evaluation Runner Pattern

Production-ready evaluation workflow:

# 1. Run evaluation from file
results = run_evaluation_from_file(
    agent, "test_cases.json", threshold=0.7
)

# 2. Generate human-readable report
report = create_evaluation_report(results, output_file="report.md")

# 3. CI/CD integration
# python run_eval.py --test-cases test_cases.json --output report.md

CLI for Evaluation

Use Typer to build evaluation CLIs with --model, --threshold, and --output flags. Run evals in CI pipelines before every merge.

DeepEval vs Custom Evaluation

Aspect Custom LLM-as-Judge DeepEval
Metric quality Depends on your prompt Research-backed, calibrated
Maintenance You own the prompts Community + Confident AI
Agentic metrics Build from scratch 6+ agent-specific metrics
Tracing Manual instrumentation @observe decorator
CI/CD integration Custom scripts pytest native

F. Wrap-up

Key Takeaways

  1. Tracing is non-negotiable — every step, every tool call, every cost
  2. Loop detection uses three strategies: exact, fuzzy, and stagnation
  3. Circuit breakers inject loop warnings into the conversation
  4. Cost tracking prevents runaway spending with per-query budgets
  5. DeepEval provides production-grade metrics and tracing for agent evaluation
  6. Embedding-based metrics enable offline, cost-free evaluation
  7. Test case management with JSON datasets scales evaluation efforts
  8. CLI runners make evaluation CI/CD-ready

Lab Preview: The Broken Agent

Step 1: Instrumentation

  • Inject AgentTracer into ReactAgent
  • Run a query that triggers a loop

Step 2: Diagnosis

  • Read the trace JSON/logs
  • Identify the repeating tool calls

Step 3: The Fix

  • Implement a circuit breaker
  • Add loop detection to the agent loop

Step 4: Verify with DeepEval

  • Run evaluation with run_eval.py CLI
  • Load test cases from test_cases.json
  • Generate evaluation report

Time: 75 minutes

Questions?

Session 3 Complete